Introduction

Motivation

barcodetrackR is an R package developed for the analysis and visualization of clonal tracking data from cellular barcoding experiments.

Installation

Currently, barcodetrackR is available at Github and can be downloaded using the devtools package.

Contributors

The R package and functions were created by Diego A. Espinoza, Ryland Mortlock, Samson J. Koelle, and others at Cynthia Dunbar’s laboratory at the National Heart, Lung, and Blood Institutes of Health. Issues should be addressed to https://github.com/d93espinoza/barcodetrackR/issues.

Loading data

Loading required packages

The barcodetrackR package operates on SummarizedExperiment objects from the Bioconductor repository. It stores associated colData for each sample in this object as well as any metadata. We load the barcodetrackR and SummarizedExperiment packages here for our analyses, as well as the magrittr package in order to improve legibility of code through using the pipe %>% operator.

Creating objects with create_SE

For this vignette, we will load publically available data from the following papers (these sample datasets are included in the R package):

In addition to reads data, the Six et al. paper includes “estimated abundance” data for each of these timepoints. We load these into a new SE in order to compare our analyses downstream.

Our input dataframes to create the SummarizedExperiment (SE) objects are each an n x m data.frame where there are n rows of observations (typically cellular barcodes, insertion sites, or the like) and the m columns are the samples. The input metadata must have row order identical to the order of the colums in its corresponding dataframe. The metadata must also have a column titled SAMPLENAME that denotes the column of your_data it refers to.

Assays created by create_SE

create_SE takes the input dataframe and metadata and creates a SummarizedExperiment object with the following assays:

  • counts: the raw values from the input dataframe
  • percentages: the per-column proportions of each entry in each column
  • ranks: the rank of each entry in each column
  • normalized: the normalized read values (CPM)
  • logs: the log of the normalized values
## List of length 5
## names(5): counts percentages ranks normalized logs

Correlations

scatter_plot

A straightforward way to view the relationship between samples in a pairwise manner is to view basic scatter plots of two samples using the provided assays. We provide a scatter_plot function as part of the package.

Here, we view the correlation of barcode counts between different cell types at the 20 month timepoint of the Wu et al study. We compare granulocytes (Gr) to B and T cells.

cor_plot

A more comprehensive way to view the relationship between samples in a pairwise manner is to use a correlation plot. Here, we view the Pearson correlation between the T, B, Gr, NK CD56+/CD16-, and NK CD56-/CD16+ fractions within the Wu dataset for the 6, 9.5, 12, and 20 month post-transplant timepoints.

We can also use the spearman correlation as our metric of interest, and visualize our correlation plot using circles with areas proportional to the absolute value of the correlation and colors corresponding to the value of the correlations.

We can return a table of the Pearson correlations as well as the p-values and confidence intervals for each of the comparisons above.

##    sample_i    sample_j correlation_value       p_value     ci_lo     ci_hi
## 1 ZJ31_6m_T   ZJ31_6m_T         1.0000000  0.000000e+00 1.0000000 1.0000000
## 2 ZJ31_6m_T ZJ31_9.5m_T         0.4672994  0.000000e+00 0.4527245 0.4816246
## 3 ZJ31_6m_T  ZJ31_12m_T         0.4406121  0.000000e+00 0.4261147 0.4548832
## 4 ZJ31_6m_T  ZJ31_20m_T         0.3730234  0.000000e+00 0.3564373 0.3893744
## 5 ZJ31_6m_T   ZJ31_6m_B         0.2263901 7.739959e-201 0.2122372 0.2404479
## 6 ZJ31_6m_T ZJ31_9.5m_B         0.2346975 2.001192e-191 0.2197056 0.2495785

Above, we used two of the three available correlation visualizations ("color" and "circle") using the standard color palette provided. A no_negative parameter is offered as well to eliminate negative correlations within the data, from which deriving biological meaning may be difficult.

When there are a smaller number of samples to analyze, the "number" option can be used to view the actual correlation within the grid. Here is an example comparing the pearson correlation of clones within peripheral blood at subsequent timepoints from the belberdos sample data set.

Clonal patterns

barcode_ggheatmap

A useful visualization to study clonal patterns over time is by using a heat map which clusters the top clones based on relatedness and displays their proportion in each sample over time. Our function barcode_ggheatmap does this by choosing the top N clones (n_clones) within each sample and tracking them over time. The argument n_clones assumes that in most cases, the large-contributing clones are of most interest to the user. This assumption can be relaxed by passing a large value to the argument.

We first visualize the top 10 clones from the selected samples in the Wu dataset. The default cell note is stars for the top 10 clones in each sample.

We can also visualize the percentage contribution of each of these clones and add a dendogram which clusters clones based on the Euclidean distance between each clone’s log assay. Here we plot the top 5 clones within the Six dataset. First, we order the columns to group them by celltype.

Splitting the clones into 4 clusters makes it easy to identify clones which are biased towards Monocytes (purple), T cells (green), NK cells (light blue), and B cell / Granulocyte (red).

The Belderbros dataset generally contains less clones than either the Wu or Six datasets, likely because the data is from mice. When possible, using the “percentages” option of the cellnote_assay aqrgument provides maximal information.

Compared to the other data sets, this one generally shows a smaller number of clones making up a large proportion of the hematopoiesis.

barcode_binary_heatmap

In some cases, we may be interested in a global view of the presence or absence of barcodes across samples, regardless of read abundance. In that case, a binary heat map can be generated using barcode_binary_heatmap to give the simplest visual representation. Here we view the binary heat map of the belderbos data with a threshold of 0.01, meaning clones that make up less than 1% of a sample are treated as not detected.

clonal_contribution

Another familiar way to visualize clonal patterns over time is using a line or bar chart showing the proportion of top clones. In the above heat maps for wu data, we could see that there are some large uni-lineage CD56-/CD16+ NK cell clones. We can view the expansion of the top clones from the final timepoint through a stacked area line chart showing the proportion of each clone in CD56-/CD16+ NK cell samples across time.

The colored clones represent the top 10 clones in the 20 month CD56-/CD16+ NK sample. The gray clones are other smaller clones which are found in the CD16+ samples at any of the four timepoints.

For data with fewer clones, a bar chart might be appropriate. We can view the peripheral blood samples from the belderbos dataset, previously shown as a heat map. By turning the plot_non_selected argument to false, we can only look at the top 10 clones from the chosen sample and ignore. We can also use categorical spacing on the x-axis rather than numeric by setting keep_numeric to false.

Clonal Bias

ridge_plot

The ridge plot function calculates the density of clones at different values of bias between two selections. Note that log_bias is calculated as log(selection_1/selection_2 + 1) so that it can handle clones which are zero in one of the samples. If the "plot_by" argument is provided, the ridge plot will have a y axis corresponding to samples along the variable provided which must be a column of colData.

Here, we view a ridge plot showing clonal bias between B and T cells from the Wu dataset for the 6, 9.5, 12, and 20 month post-transplant timepoints.

By default, the ridge plot calculates the density of clones at each value of the log bias. It does not consider the fractional contribution of the clones. If you want larger clones to be more influential to the density estimation, set weighted = TRUE to make a weighted heat map.

For this dataset, the weighted heat map shows a more balanced (less biased) clonality between B and T cells as compared to the regular ridge plot. This means that although there are highly biased clones, the larger clones tend to be shared between B and T cells.

Creating a ridge plot based on the Six et al data also shows that the clonality between B and T cells (as assessed by viral integration site analysis) becomes less biased over time.

bias_histogram

bias_distribution_barplot

ternary_plot

Diversity

Circos

Utilities